This project used proteomic data from mass spectrometry, aiming to find out how the phosphorylation of specific proteins in signaling pathways can impact cell differentiation under given conditions. After data manipulation, statistical process, and analysis among databases, associations between special proteins and possible pathways with biological activities were established.
Professor Carolina Vieira and her colleagues from the University of Brasília want to know what makes the pluripotent stem cells able to return to an undifferentiated state from a differentiated state, under specific conditions. Post-translation modification can be critical in this process. After treating cells under different conditions to control their pace of differentiation, data were generated by mass spectrometry. The dataset containing only the Phosphorylated proteins is referred to as Phospho, and the whole proteomic dataset with no enrichment of phosphorylation is called Proteome. The target of this project is to obtain a list containing the names of the proteins that increase in phosphorylation and interpret the result.
Workflow
Non-differentiated C2C12 mouse muscle cells were cultured in growth media with horse serum to induce differentiation. Leukemia inhibitory factor (LIF) was applied to one group of cells to inhibit differentiation, while the other group (Control) was cultured without LIF.
Before running on mass spectrometry, the protein samples were fractionated. By matching the generated data to the FASTA database of mouse proteins, the peptides and proteins were identified and quantified. Furthermore, aliquots of the two samples are enriched for phosphorylated peptides.
To know if one protein has a higher level of phosphorylation in the LIF-treated sample vs. the control sample, we have to consider both datasets. Only when a protein in the Proteome dataset has no difference between LIF-treated and Control, meanwhile the same protein expressed differently between these two conditions, such protein would be considered differentially phosphorylated.
Conditions to define “more phosphorylated”
The codes and command lines used in this project were separated into several blocks with comments and annotation according to the operation steps in the instruction.
phospho <- read.csv("./Dataset/Dataset_Phospho.csv" , header = T , sep = ";" , stringsAsFactors = F , fileEncoding = "UTF-8-BOM")
proteome <- read.csv("./Dataset/Dataset_Proteome.csv" , header = T , sep = ";" , stringsAsFactors = F , fileEncoding = "UTF-8-BOM")
The two .csv files were read into two data frames, phospho and proteome respectively.
fileEncoding=“UTF-8-BOM” is to remove “ï..” of the first column.
phospho$Abundance.Control.Average <- rowMeans(phospho[ ,c("Abundance.Control.1" , "Abundance.Control.2" , "Abundance.Control.3")] , na.rm = TRUE)
phospho$Abundance.LIF.Average <- rowMeans(phospho[ ,c("Abundance.LIF.1" , "Abundance.LIF.2" , "Abundance.LIF.3")] , na.rm = TRUE)
Two new columns in phospho, “Abundance.Control.Average” and “Abundance.LIF.Average”, were appended. The arithmetic mean of LIF and Control group for each row were calculated by the function “rowMeans”. na.rm=TRUE to omit the row with “NA” when calculating.
phospho$ratio <- phospho$Abundance.LIF.Average / phospho$Abundance.Control.Average
A new “ratio” column was appended. The value of each row is the ratio of the mean of LIF and Control.
temp <- phospho[ ,c(10:15)]
temp <- na.omit(temp)
Create a temporary data frame for following Student’s t test. This temporary data frame contains only Abundance Control 1-3 and Abundance LIF 1-3 without NA. To run the t test properly, NA was removed ahead of schedule in order, which should be done in step 3a.
LIF <- grep("LIF" , colnames(temp) , value = TRUE)
Control <- grep("Control", colnames(temp) , value = TRUE)
p <- apply(temp ,
1 ,
function(x)
t.test(x[LIF] ,
x[Control])$p.value)
phospho_no_na <- na.omit(phospho)
phospho_no_na$pvalue <- p
Create two vectors containing the column headers including “LIF” and “Control” respectively. Function “apply” was used to apply function “t.test” on each row of the temporary data frame. Function “t.test” was performed between the “LIF” group and the “Control” group according to the temporary vectors created previously. After the t test, the p.value was stored in a vector. This vector was appended to a new phospho data frame without NA as a column.
Empty cells and NAs have already been removed in previous step.
library(dplyr)
More_Phosphorylated <- filter(phospho_no_na , ratio >= 2)
Package “dplyr” was loaded to use its function “filter”. Function “filter” was used to filter in the rows have value greater than 2 in ratio column from phospho_no_na to A new More_Phosphorylated data frame.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("STRINGdb")
Download and install the “STRINGdb” package if it doesn’t exist in the current system
library(STRINGdb)
Run the “STRINGdb” package after installation
string_db <- STRINGdb$new(version = "11" , species = 10090)
Create an object link to the STRING database for the following search.
This searching will use stringdb version 11 on Mus musculus (Mouse, Taxon identifier: 10090)
An optional arguments, score_threshold=400 are omitted to set as default for first attempt.
mapped <- string_db$map(More_Phosphorylated , "Master.Protein.Accessions" , removeUnmappedRows = TRUE )
Map(Search) the entries in “Master.Protein.Accessions” column of “More_Phosphorylated” data frame in the STRING database.
A column “STRING_id” appended.
The “removeUnmappedRows” set as True to remove the rows that cannot be mapped to STRING. By default, the argument was FALSE, those unmapped lines are left and their STRING_id is set to NA.
hits <- mapped$STRING_id
Create a vector called “hits”, which contains the mapped STRING_id.
string_db$plot_network( hits )
According to the search result, an interaction plot of the listed protein was generated.
Protein interaction plot of “More_Phosphorylated”
Legend of protein interaction plots
proteome_gt_005 <- filter(proteome , Abundance.Ratio.P.Value...LIF.....Control. > 0.05)
proteome_to_match <- proteome_gt_005
phospho_to_match <- More_Phosphorylated
names(phospho_to_match)[1] <- "Accession"
The entries in the proteome dataset with p.value greater than 0.05 were filtered in a new data frame “proteome_gt_005”
Create 2 data frames containing the existing data to avoid confounding, since the following step is to rename the column name to making them identical between two data frames, make it available for the next step.
MORE_More_Phosphorylated <- inner_join(proteome_to_match , phospho_to_match , by = "Accession")
mapped2 <- string_db$map( MORE_More_Phosphorylated , "Accession" , removeUnmappedRows = TRUE )
hits2 <- mapped2$STRING_id
string_db$plot_network( hits2 )
Function “inner_join” from package “dplyr” was used to merge the two data frames by column “Accession”. As the illustration below, this function will keep the elements from both data frames.
How inner_join works
After the merging step. The similar commands to generate protein interaction plots were run again for a narrowed list.
MORE_More_Phosphorylated
Stat5b has direct interaction with Ptpn11, Numa1, and Stat5a when searching in a loose setting.
string_db2 <- STRINGdb$new(version = "11" , species = 10090 , score_threshold = 700)
mapped3 <- string_db2$map(MORE_More_Phosphorylated , "Accession" , removeUnmappedRows = TRUE )
hits3 <- mapped3$STRING_id
string_db2$plot_network( hits3 )
However, Only Stat5a, Ptpn11, Gab1, Sos1 retain strong interaction with Stat5b under stringent condition(score_threshold=700).
Five interaction proteins with strong confidence
write.table(More_Phosphorylated$Master.Protein.Accessions , file = "More_Phosphorylated.txt" , quote = F, row.names = F, col.names = F)
write.table(MORE_More_Phosphorylated$Accession , file = "MORE_More_Phosphorylated(after_match).txt" , quote = F , row.names = F , col.names = F)
An alternative query can be done in STRING
Protein-interactions images under different settings can be found above.
Functional enrichment analysis (also known as Gene set enrichment analysis, GSEA) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes. There are many online tools to perform Functional Enrichment Analysis or similar analysis, such as KEGG and GENE ONTOLOGY. Also, Enrichr, DAVID, and NetworkAnalyst have developed websites to integrate different databases allowing users to generate and compare the result in several ways. The STRING protein interaction database we used previously also provides a crosslink function to GENE ONTOLOGY and KEGG to do functional enrichments in the result page when you do the search on webpage.
Some inferences that can be made after running the filtered out gene list on different databases. Chronic myeloid leukemia(mmu05220), ErbB signaling pathway(mmu04012), and Jak-STAT signaling pathway(mmu04630) turn to be the 3 most enriched pathways in KEGG 2019 MOUSE database. This result coincides with the STRINGdb protein interaction result.
The 5 proteins (Stat5b, Stat5a, Ptpn11, Gab1, Sos1) retain interaction under stringent settings (score_threshold=700) play exact important roles in these 3 pathways. Ptpn11 is in Chronic myeloid leukemia and Gab1 is in ErbB signaling pathways, while the other 3 gene involves both CML and ErbB signaling pathway.
In GENE ONTOLOGY database, these 5 genes were analyzed in 3 categories. In Biological Process(BP), the 5 genes are enriched in Positive regulation of mitotic cell cycle(GO:0045931), Regulation of multicellular organism growth(GO:0040014), and Epidermal growth factor receptor signaling pathway(GO:0007173). Epidermal growth factor receptor (EGFR) is also known as Erbb or Erbb1. In mice, binding of the ligand of EGFR triggers homo- and/or heterodimerization of the receptor triggering its autophosphorylation, heterodimer with ERBB2. In Molecular Function(MF), these genes involve protein binding(GO:0005515) and signal transducer activity(GO:0004871) These 2 terms show with FDR around 0.1 even they are top 2 hits in this category. In Cell Component(CC), they all exist in cytosol(GO:0005829)
| Category | Term | Description | Genes | PValue | FDR |
|---|---|---|---|---|---|
| GO_BP_DIRECT | GO:0045931 | positive regulation of mitotic cell cycle | STAT5A, STAT5B, PTPN11 | 0.0000182 | 0.0013764 |
| GO_BP_DIRECT | GO:0040014 | regulation of multicellular organism growth | STAT5A, STAT5B, PTPN11 | 0.0000257 | 0.0013764 |
| GO_BP_DIRECT | GO:0007173 | epidermal growth factor receptor signaling pathway | GAB1, PTPN11, SOS1 | 0.0000362 | 0.0013764 |
| GO_MF_DIRECT | GO:0005515 | protein binding | STAT5A, STAT5B, GAB1, PTPN11, SOS1 | 0.0030232 | 0.0876737 |
| GO_MF_DIRECT | GO:0004871 | signal transducer activity | STAT5A, STAT5B, GAB1 | 0.0078629 | 0.1140126 |
| GO_CC_DIRECT | GO:0005829 | cytosol | STAT5A, STAT5B, PTPN11, SOS1 | 0.0027806 | 0.0361478 |
| KEGG_PATHWAY | mmu05220 | Chronic myeloid leukemia | STAT5A, STAT5B, PTPN11, SOS1 | 0.0000031 | 0.0000943 |
| KEGG_PATHWAY | mmu04012 | ErbB signaling pathway | STAT5A, STAT5B, GAB1, SOS1 | 0.0000056 | 0.0000943 |
| KEGG_PATHWAY | mmu04630 | Jak-STAT signaling pathway | STAT5A, STAT5B, PTPN11, SOS1 | 0.0000259 | 0.0002940 |
Amongst the 5428 entries of phospho dataset and 6146 entries of proteome dataset, only 68 entries were retained after the procedure of data cleaning, calculation, filter, and matching. Eight duplications were removed when performing STRING. One protein may break down into different peptides before loading to mass spectrometry, which can be the source of duplications. According to the result (plot and data) from STRING analysis, 5 out of 60 proteins, Stat5b, Stat5a, Ptpn11, Gab1, Sos1, were regarded as key roles.
By running functional enrichment analysis on the 60 filtered entries, the importance of the 5 proteins was confirmed again. These 5 proteins are involved in critical biological activities with a low false discovery rate (FDR), meaning strong confidence. They are all mainly expressed in cytosol, participating in signaling transducer activity and protein binding, especially playing important roles in the regulation of mitotic cell cycle, multicellular organism growth, and EGFR signaling pathway. Such activities highly indicate two possible pathways, chronic myeloid leukemia, and ErbB signaling pathway.
The LIF introduced into culture media has strong impact on ErbB and Jak-STAT pathway via phosphorylation, which is important for differentiation. However, Jak-STAT pathway seems show direct effect from LIF, which affect many members of STAT family, not only Stat5a and Stat5b. Jak-STAT pathway also shows association with MAPK pathway through Ptpn11 and Sos1, leading to the regulation of cell proliferation and differentiation. In other word, the Jak-STAT pathway shows up in the final list might be the result from LIF directly, instead of the true positive result what we looking for. A contrast experiment may need to confirm this hypothesis.
Jak-STAT signaling pathway
Current existing literature concludes not only hematological malignancy, both myeloid and lymphoblastic, but also non-cancerous diseases such as inflammation and T helper cell deficiency. Nevertheless, the result of this project mainly implies malignant hematopoietic disease, especially myeloid. It emphasizes the possible relationship between the malfunction of these target proteins may lead to inappropriate phosphorylation in certain pathways, causing the loss of regulation of hematopoietic stem cells. It is worthy to probe further deeper into this field in the future. In Vitro and in vivo experiments with certain gene modification, or similar manipulations on human cell lines can give more hints about this.
dim(filter(proteome , Abundance.Ratio.P.Value...LIF.....Control. < 0.05))[1]
There are 294 entries with p-value less than 0.05 in proteome dataset.
All adjusted p-value in proteome dataset is greater than 0.85.
For STRINGdb search, 2 different score threshold were applied, 400 for default and 700 for stringent respectively.
The statistical significance threshold(α) for testing in this project was set as 0.05.
In the phospho dataset, there are 87 peptides (79 if remove 8 duplications) were listed in more phosphorylated in step 3b, by filtering the ratio greater than 2.
If match the filter result with the proteome dataset as the criteria mentioned in the introduction section, 68 (60 if remove 8 duplications) should be defined as more phosphorylated for both two datasets.
Chronic myeloid leukemia(mmu05220) and Jak-SATA signaling pathway(mmu04630) include STAT5A, STAT5B, PTPN11, SOS1 and ErbB signaling pathway(mmu04012) includes STAT5A, STAT5B, GAB1, SOS1.
Stat5b, Stat5a, Ptpn11, Gab1, Sos1, these 5 proteins involved in critical biological activities with low false discovery rate in enrichment analysis. It implies they may play important roles in myeloid malignancy development and phosphorylation of ErbB signaling pathway. Details can be found in the Result and Discussion sections above.
https://stackoverflow.com/questions/28383959/error-using-t-test-not-enough-x-observations
https://www.rdocumentation.org/packages/STRINGdb/versions/1.8.1/topics/STRINGdb
https://string-db.org/cgi/input?sessionId=bqoOSokD8OSJ&input_page_show_search=on
https://www.bioconductor.org/packages/release/bioc/vignettes/STRINGdb/inst/doc/STRINGdb.pdf
https://maayanlab.cloud/Enrichr/
https://david.ncifcrf.gov/tools.jsp
https://www.networkanalyst.ca/)
https://www.pnas.org/content/102/43/15545.long
https://www.uniprot.org/uniprot/P42230
https://www.ncbi.nlm.nih.gov/gene/13649
https://www.tandfonline.com/doi/full/10.4161/jkst.25155
https://www.sciencedirect.com/science/article/pii/S1043466609001902?via%3Dihub
https://www.oncotarget.com/article/7128/text/
https://www.mdpi.com/1422-0067/18/9/1904
https://www.frontiersin.org/articles/10.3389/fonc.2017.00218/full